Why you need to develop computational and analytical skills
Opportunities for biologists to undertake bioinformatic analysis
Challenges you may face
3 real life examples of biologists developing bioinformatics skills at
Head of Bioinformatics Core, CRUK Cambridge Institute
Why you need to develop computational and analytical skills
Opportunities for biologists to undertake bioinformatic analysis
Challenges you may face
3 real life examples of biologists developing bioinformatics skills at
Technological advances have accelerated scientific discovery but place increasing demands on data handling and analysis capabilities.
Computational tools and workflows are increasingly designed for use by biologists.
Programming languages such as R are now easier to use
Drive for openness, transparency and reproducibility
Web-based platforms
Cloud-based computing
Containers
Singularity
Aim is to find countries with lowest population densities.
How would you do this?   What steps are involved?
# load tidyverse packages containing the functions we'll need
library(readxl)
library(dplyr)
# read population spreadsheet into R and change column headers
populations <- read_excel("country_data.xlsx", sheet = 1, skip = 1)
colnames(populations) <- c("country", "population")
# read land area spreadsheet into R
areas <- read_excel("country_data.xlsx", sheet = 2)
colnames(areas) <- c("country", "total_area", "water_area", "notes")
# calculate land area
areas <- mutate(areas, land_area = total_area - water_area)
# combine population and area tables
combined_data <- full_join(populations, areas, by = "country")
# calculate population density
combined_data <- mutate(combined_data, density = population / land_area)
# sort into order of population density, filter those with the lowest values and select columns for display combined_data %>% arrange(density) %>% filter(density < 5.0) %>% select(country, population, total_area, water_area, land_area, density, notes)
## # A tibble: 6 x 7 ## country population total_area water_area land_area density notes ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 Mongolia 3027398 1564110 10560 1553550 1.95 <NA> ## 2 Namibia 2479713 825615 2425 823190 3.01 <NA> ## 3 Austral… 24125848 7692024 58459 7633565 3.16 The largest country in Ocea… ## 4 Iceland 332474 103000 2750 100250 3.32 <NA> ## 5 Libya 6293253 1759540 0 1759540 3.58 <NA> ## 6 Canada 36289822 9984670 891163 9093507 3.99 Largest country in the West…
The dplyr package provides a useful set of functions for manipulating, combining and filtering tabular data.
Multispectral optoacoustic tomography
Image courtesy of Isabel Quiroz Gonzales
Isabel is combining selected measurements from multiple imaging runs and filtering for interesting results.
Opportunities
Automation
Reuse
Reproducibility
Challenges
Learning R can be hard
Many ways of achieving the same thing in R but which one to use?
You know what you want to do but how do you find the right function to use?
Mathilde is monitoring the weight of mice at various timepoints and wants to plot the weight, scaled by the maximum weight of each mouse, by age.
Courtesy of Mathilde Colombe
Image courtesy of Mathilde Colombe
A common problem – the format used to collect the data is not suitable for analysis and visualization.
We need to reformat the data into a tidy form.
Step 1:Â Â Read the data into R in its current wide format
library(tidyverse)
data <- read_csv("muc2_weight.csv")
data
## # A tibble: 37 x 65 ## ID `Cage number` Sex DOB `Initial weight` `25/09/2017` `02/10/2017` `09/10/2017` ## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 AN17… 13536 M 01/0… 32.5 32.5 32.7 33.6 ## 2 AN17… 9381 F 10/0… 28.4 28.4 27.5 29.5 ## 3 18/1… 347 F 19/1… 11.4 NA NA NA ## 4 18/1… 347 F 19/1… 12.5 NA NA NA ## 5 18/1… 348 M 19/1… 12.5 NA NA NA ## 6 18/1… 349 M 19/1… 14.1 NA NA NA ## # … with 31 more rows, and 57 more variables: `16/10/2017` <dbl>, `24/10/2017` <dbl>, ## # `31/10/2017` <dbl>, `07/11/2017` <dbl>, `14/11/2017` <dbl>, `21/11/2017` <dbl>, ## # `28/11/2017` <dbl>, `05/12/2017` <dbl>, `12/12/2017` <dbl>, `19/12/2017` <dbl>, ## # `27/12/2017` <dbl>, `02/01/2018` <dbl>, `09/01/2018` <dbl>, `16/01/2018` <dbl>, ## # `23/01/2018` <dbl>, `30/01/2018` <dbl>, `06/02/2018` <dbl>, `13/02/2018` <dbl>, ## # `20/02/2018` <dbl>, `27/02/2018` <dbl>, `06/03/2018` <dbl>, `13/03/2018` <dbl>, ## # `20/03/2018` <dbl>, `27/03/2018` <dbl>, `03/04/2018` <dbl>, `18/04/2018` <dbl>, ## # `25/04/2018` <dbl>, `02/05/2018` <dbl>, `10/05/2018` <dbl>, `17/05/2018` <dbl>, ## # `24/05/2018` <dbl>, `30/05/2018` <dbl>, `06/06/2018` <dbl>, `14/06/2018` <dbl>, ## # `20/06/2018` <dbl>, `27/06/2018` <dbl>, `04/07/2018` <dbl>, `11/07/2018` <dbl>, ## # `18/07/2018` <dbl>, `25/07/2018` <dbl>, `01/08/2018` <dbl>, `08/08/2018` <dbl>, ## # `15/08/2018` <dbl>, `22/08/2018` <dbl>, `29/08/2018` <dbl>, `05/09/2018` <dbl>, ## # `12/09/2018` <dbl>, `19/09/2018` <dbl>, `26/09/2018` <dbl>, `03/10/2018` <dbl>, ## # `10/10/2018` <dbl>, `17/10/2018` <dbl>, `24/10/2018` <dbl>, `31/10/2018` <dbl>, ## # `07/11/2018` <dbl>, `14/11/2018` <dbl>, `21/11/2018` <dbl>
Step 2:Â Â Convert from wide format to long (or tidy) format
data <- pivot_longer(data, cols = 6:65, names_to = "Date", values_to = "Weight", values_drop_na = TRUE) data
## # A tibble: 594 x 7 ## ID `Cage number` Sex DOB `Initial weight` Date Weight ## <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> ## 1 AN17/16957 (1L) 13536 M 01/05/2017 32.5 25/09/2017 32.5 ## 2 AN17/16957 (1L) 13536 M 01/05/2017 32.5 02/10/2017 32.7 ## 3 AN17/16957 (1L) 13536 M 01/05/2017 32.5 09/10/2017 33.6 ## 4 AN17/16957 (1L) 13536 M 01/05/2017 32.5 16/10/2017 32.3 ## 5 AN17/16957 (1L) 13536 M 01/05/2017 32.5 24/10/2017 32.1 ## 6 AN17/16957 (1L) 13536 M 01/05/2017 32.5 31/10/2017 33.4 ## # … with 588 more rows
The tidyr package provides functions to transform your data into a tidy format.
Step 3:Â Â Convert dates and calculate age
library(lubridate) data <- mutate(data, DOB = dmy(DOB), Date = dmy(Date), Age = Date - DOB) select(data, ID, DOB, Date, Age, Weight)
## # A tibble: 594 x 5 ## ID DOB Date Age Weight ## <chr> <date> <date> <drtn> <dbl> ## 1 AN17/16957 (1L) 2017-05-01 2017-09-25 147 days 32.5 ## 2 AN17/16957 (1L) 2017-05-01 2017-10-02 154 days 32.7 ## 3 AN17/16957 (1L) 2017-05-01 2017-10-09 161 days 33.6 ## 4 AN17/16957 (1L) 2017-05-01 2017-10-16 168 days 32.3 ## 5 AN17/16957 (1L) 2017-05-01 2017-10-24 176 days 32.1 ## 6 AN17/16957 (1L) 2017-05-01 2017-10-31 183 days 33.4 ## # … with 588 more rows
The dplyr package provides functions for data manipulation that work together in a consistent and coherent manner.
Step 4:Â Â Scale the weights by the maximum recorded weight for each individual
data <- data %>% group_by(ID) %>% mutate(ScaledWeight = Weight / max(Weight)) %>% ungroup() select(data, ID, DOB, Date, Age, Weight, ScaledWeight)
## # A tibble: 594 x 6 ## ID DOB Date Age Weight ScaledWeight ## <chr> <date> <date> <drtn> <dbl> <dbl> ## 1 AN17/16957 (1L) 2017-05-01 2017-09-25 147 days 32.5 0.923 ## 2 AN17/16957 (1L) 2017-05-01 2017-10-02 154 days 32.7 0.929 ## 3 AN17/16957 (1L) 2017-05-01 2017-10-09 161 days 33.6 0.955 ## 4 AN17/16957 (1L) 2017-05-01 2017-10-16 168 days 32.3 0.918 ## 5 AN17/16957 (1L) 2017-05-01 2017-10-24 176 days 32.1 0.912 ## 6 AN17/16957 (1L) 2017-05-01 2017-10-31 183 days 33.4 0.949 ## # … with 588 more rows
'%>%' pipes the output of one operation into the next.
Step 5:Â Â Create weight vs age plot
ggplot(data = data, mapping = aes(x = Age, y = ScaledWeight, colour = ID)) + geom_line()
Step 5:Â Â Create separate weight vs age plots
ggplot(data = data, mapping = aes(x = Age, y = ScaledWeight, colour = ID)) + geom_line() + facet_wrap(vars(ID))
Change representation to a histogram
ggplot(data = data, mapping = aes(x = ID, y = Weight, colour = Sex)) + geom_boxplot()
Opportunities
Exploratory data analysis
Plotting is a useful way to explore and understand your data
Learning the ggplot2 'grammar of graphics' means you'll be able to create a wide range of plots using a common syntax
Reuse
Publication quality graphics
Challenges
R has a steep learning curve
ggplot2 is relatively straightforward once you get the hang of it but customizing plots to be exactly as you want can be tricky
Carolin is investigating the relationship between copy number alterations and centrosome instability in ovarian carcinomas and is processing large numbers of samples using shallow whole genome sequencing.
Courtesy of Carolin Sauer
Opportunity
The Brenton lab need to be able to run the analysis themselves as and when data are available
Problem
Opportunity
The Brenton lab need to be able to run the analysis themselves as and when data are available
Problem
Solution
Opportunities
Access to large-scale data processing
Systematic approach for large cohort studies
Biologists don't have to wait for a bioinformatician to process their data, obtain results sooner
Accelerated pipeline development
Off-the-shelf analysis packages and pipelines
Challenges
Unix command line
Use of high-performance compute clusters or cloud computing
Workflow engines add a level of complexity
Troubleshooting failures in pipeline runs not for the faint-hearted
Learning R will empower you to:
explore, analyze and visualize your data more effectively
handle repetitive and error-prone tasks efficiently
create elegant reports that combine your code, results, plots and narrative text